Introdução à Programação com Triton: O Paradoxo do Desempenho: Por Que Código Correto é Lento

O Paradoxo do Desempenho afirma que um kernel matematicamente perfeito, como $out = x + y$, pode realmente se comportar pior que um loop no CPU se não conseguir amortizar os custos fixos do hardware da GPU. Isso frequentemente se manifesta como o Imposto de Inicialização.

1. O Falácia da "Correção"

A correção funcional não é um indicador de eficiência. Embora seu código Triton possa distribuir corretamente o trabalho entre milhares de threads, se a quantidade total de trabalho (N) for pequena, a GPU permanecerá subutilizada. O hardware gasta mais tempo em transições de estado do que em operações aritméticas reais.

2. A Armadilha da Medição em Python

Medir o desempenho de código GPU em Python usando time.time() é perigoso. Chamadas GPU são assíncronas; o Python simplesmente coloca na fila o comando e segue em frente. Sem torch.cuda.synchronize(), você mede o tempo de fila. Com sincronização, você mede a latência Host-to-Device, que geralmente é 10 vezes maior que a própria execução do kernel.

3. Latência vs. Throughput

Para superar o paradoxo, você deve fornecer trabalho suficiente para "esconder" a latência de inicialização. Este é o processo de transição de um regime limitado por latência regime (limitado pela barramento CPU-GPU) para um regime limitado por throughput regime (limitado pela memória ou computação da GPU).

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).

N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch

N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic

N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch

All are compute-bound.

QUESTION 2

In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?

Arithmetic Throughput

Memory Bandwidth

L1 Cache Size

QUESTION 3

What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?

The GPU and CPU always finish at the same time.

The CPU continues to the next line of code before the GPU kernel finishes.

The kernel runs faster on smaller GPUs.

Memory transfers are blocked by compute.

QUESTION 4

Why does $out = x + y$ exhibit low arithmetic intensity?

It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.

The addition operation is too complex for the ALUs.

It requires shared memory synchronization.

It only runs on one SM.

QUESTION 5

How can the 'Launch Tax' be amortized in a real-world application?

By calling the kernel more frequently with smaller data.

By increasing the workload per launch (e.g., larger N or batching).

By using 16-bit floats instead of 32-bit floats.

By disabling the L2 cache.